Skip to content

Conversation

@maropu
Copy link
Member

@maropu maropu commented Oct 2, 2020

What changes were proposed in this pull request?

Support query auto cancelling when running too long on thriftserver.

This is the rework of #28991 and the credit should be the original author, @leoluan2009.

Closes #28991

Why are the changes needed?

For some cases, we use thriftserver as long-running applications.
Some times we want all the query need not to run more than given time.
In these cases, we can enable auto cancel for time-consumed query.Which can let us release resources for other queries to run.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added tests.

@SparkQA
Copy link

SparkQA commented Oct 2, 2020

Test build #129347 has finished for PR 29933 at commit 335bdd0.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 2, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33957/

@SparkQA
Copy link

SparkQA commented Oct 2, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33957/

@SparkQA
Copy link

SparkQA commented Oct 2, 2020

Test build #129350 has finished for PR 29933 at commit c0318c9.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu changed the title [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver [WIP][SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver Oct 2, 2020
@SparkQA
Copy link

SparkQA commented Oct 2, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33961/

@SparkQA
Copy link

SparkQA commented Oct 2, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33961/

@SparkQA
Copy link

SparkQA commented Oct 3, 2020

Test build #129377 has finished for PR 29933 at commit eddfee9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu changed the title [WIP][SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver Oct 3, 2020
@SparkQA
Copy link

SparkQA commented Oct 3, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33985/

@SparkQA
Copy link

SparkQA commented Oct 3, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33985/


if (queryTimeout > 0) {
Executors.newSingleThreadScheduledExecutor.schedule(new Runnable {
override def run(): Unit = cancel(OperationState.TIMEDOUT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a log?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean the log (that I added in L206) before starting the thread? Looks cancel itself already has a log in https://github.com/apache/spark/pull/29933/files#diff-72dcd8f81a51c8a815159fdf0332acdcR343 though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only know from the log in cancel that the query is canceled, but don't know it is timeout.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to let user know it is canceled because of timeout.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@SparkQA
Copy link

SparkQA commented Oct 4, 2020

Test build #129387 has finished for PR 29933 at commit 42991eb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 4, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33994/

@SparkQA
Copy link

SparkQA commented Oct 4, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33994/

@maropu
Copy link
Member Author

maropu commented Oct 4, 2020

cc: @dongjoon-hyun

@SparkQA
Copy link

SparkQA commented Oct 4, 2020

Test build #129388 has finished for PR 29933 at commit 3c86e74.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 4, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33995/

@SparkQA
Copy link

SparkQA commented Oct 4, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33995/

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-26533][SQL][test-hive1.2][test-hadoop2.7] Support query auto timeout cancel on thriftserver [SPARK-26533][SQL][test-hadoop2.7] Support query auto timeout cancel on thriftserver Oct 5, 2020
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu . -Phive-1.2 is removed finally. Please update this PR.

@maropu
Copy link
Member Author

maropu commented Oct 6, 2020

Sure, thanks, @dongjoon-hyun

@maropu maropu changed the title [SPARK-26533][SQL][test-hadoop2.7] Support query auto timeout cancel on thriftserver [SPARK-26533][SQL] Support query auto timeout cancel on thriftserver Oct 6, 2020
@SparkQA
Copy link

SparkQA commented Oct 6, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34035/

@SparkQA
Copy link

SparkQA commented Oct 6, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34036/

@SparkQA
Copy link

SparkQA commented Oct 14, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34367/

@SparkQA
Copy link

SparkQA commented Oct 15, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34381/

@SparkQA
Copy link

SparkQA commented Oct 15, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34381/

@SparkQA
Copy link

SparkQA commented Oct 15, 2020

Test build #129775 has finished for PR 29933 at commit 5e3c440.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Oct 20, 2020

Anyone could check this?

@maropu
Copy link
Member Author

maropu commented Oct 20, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Oct 20, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34638/

@SparkQA
Copy link

SparkQA commented Oct 20, 2020

Test build #130032 has finished for PR 29933 at commit 5e3c440.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Oct 20, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34638/

@SparkQA
Copy link

SparkQA commented Oct 20, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34641/

@SparkQA
Copy link

SparkQA commented Oct 20, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34641/

@SparkQA
Copy link

SparkQA commented Oct 20, 2020

Test build #130034 has finished for PR 29933 at commit 5e3c440.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Oct 22, 2020

kindly ping.

@AngersZhuuuu
Copy link
Contributor

LGTM

@maropu maropu closed this in d9ee33c Oct 22, 2020
@maropu
Copy link
Member Author

maropu commented Oct 22, 2020

Thanks, all~, Merged to master.

HyukjinKwon pushed a commit that referenced this pull request Dec 11, 2020
…ask on thriftserver

### What changes were proposed in this pull request?

This PR add a new config `spark.sql.thriftServer.forceCancel` to give user a way to interrupt task when cancel statement.

### Why are the changes needed?

After [#29933](#29933), we support cancel query if timeout, but the default behavior of `SparkContext.cancelJobGroups` won't interrupt task and just let task finish by itself. In some case it's dangerous, e.g., data skew or exists a heavily shuffle. A task will hold in a long time after do cancel and the resource will not release.

### Does this PR introduce _any_ user-facing change?

Yes, a new config.

### How was this patch tested?

Add test.

Closes #30481 from ulysses-you/SPARK-33526.

Lead-authored-by: ulysses-you <[email protected]>
Co-authored-by: ulysses-you <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
yaooqinn pushed a commit to apache/kyuubi that referenced this pull request Mar 26, 2021
![ulysses-you](https://badgen.net/badge/Hello/ulysses-you/green) [![Closes #451](https://badgen.net/badge/Preview/Closes%20%23451/blue)](https://github.com/yaooqinn/kyuubi/pull/451) ![200](https://badgen.net/badge/%2B/200/red) ![17](https://badgen.net/badge/-/17/green) ![27](https://badgen.net/badge/commits/27/yellow) ![Target Issue](https://badgen.net/badge/Missing/Target%20Issue/ff0000) ![Test Plan](https://badgen.net/badge/Missing/Test%20Plan/ff0000) ![Feature](https://badgen.net/badge/Label/Feature/) [<img width="16" alt="Powered by Pull Request Badge" src="https://user-images.githubusercontent.com/1393946/111216524-d2bb8e00-85d4-11eb-821b-ed4c00989c02.png">](https://pullrequestbadge.com/?utm_medium=github&utm_source=yaooqinn&utm_campaign=badge_info)<!-- PR-BADGE: PLEASE DO NOT REMOVE THIS COMMENT -->

<!--
Thanks for sending a pull request!

Here are some tips for you:
  1. If this is your first time, please read our contributor guidelines: https://kyuubi.readthedocs.io/en/latest/community/contributions.html
  2. If the PR is related to an issue in https://github.com/yaooqinn/kyuubi/issues, add '[KYUUBI #XXXX]' in your PR title, e.g., '[KYUUBI #XXXX] Your PR title ...'.
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][KYUUBI #XXXX] Your PR title ...'.
-->

### _Why are the changes needed?_
<!--
Please clarify why the changes are needed. For instance,
  1. If you add a feature, you can talk about the use case of it.
  2. If you fix a bug, you can clarify why it is a bug.
-->
Manual cherry-pick some Spark patch into Kyuubi.
1. [Support query auto timeout cancel on thriftserver](apache/spark#29933)
2. [Add config to control if cancel invoke interrupt task on thriftserver](apache/spark#30481)

In order to keep backward with early Spark version, we hard code the config key instead of refer to Spark SQLConf.

Note that, the exists timeout of operator (`kyuubi.operation.idle.timeout`) is to cancel that client has no access with engine. That said if a query run a long time and the client is alive, the query would not be cancelled. Then the new added config `spark.sql.thriftServer.queryTimeout` can handle this case.

### _How was this patch tested?_
Add new test.

Closes #451 from ulysses-you/query-timeout.

212f579 [ulysses-you] docs
9206538 [ulysses-you] empty flaky test
ddab9bf [ulysses-you] flaty test
1da02a0 [ulysses-you] flaty test
edfadf1 [ulysses-you] nit
3f9920b [ulysses-you] address comment
9492c48 [ulysses-you] correct timeout
5df997e [ulysses-you] nit
2124952 [ulysses-you] address comment
192fdcc [ulysses-you] fix tets
d684af6 [ulysses-you] global config
1d1adda [ulysses-you] empty
967a63e [ulysses-you] correct import
128948e [ulysses-you] add session conf in session
144d51b [ulysses-you] fix
a90248b [ulysses-you] unused import
c90386f [ulysses-you] timeout move to operation manager
d780965 [ulysses-you] update docs
a5f7138 [ulysses-you] fix test
f7c7308 [ulysses-you] config name
7f3fb3d [ulysses-you] change conf place
97a011e [ulysses-you] unnecessary change
0953a76 [ulysses-you] move test
38ac0c0 [ulysses-you] Merge branch 'master' of https://github.com/yaooqinn/kyuubi into query-timeout
71bea97 [ulysses-you] refector implementation
35ef6f9 [ulysses-you] update conf
0cad8e2 [ulysses-you] Support query auto timeout cancel on thriftserver

Authored-by: ulysses-you <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
yaooqinn pushed a commit to apache/kyuubi that referenced this pull request Mar 26, 2021
![ulysses-you](https://badgen.net/badge/Hello/ulysses-you/green) [![Closes #451](https://badgen.net/badge/Preview/Closes%20%23451/blue)](https://github.com/yaooqinn/kyuubi/pull/451) ![200](https://badgen.net/badge/%2B/200/red) ![17](https://badgen.net/badge/-/17/green) ![27](https://badgen.net/badge/commits/27/yellow) ![Target Issue](https://badgen.net/badge/Missing/Target%20Issue/ff0000) ![Test Plan](https://badgen.net/badge/Missing/Test%20Plan/ff0000) ![Feature](https://badgen.net/badge/Label/Feature/) [<img width="16" alt="Powered by Pull Request Badge" src="https://user-images.githubusercontent.com/1393946/111216524-d2bb8e00-85d4-11eb-821b-ed4c00989c02.png">](https://pullrequestbadge.com/?utm_medium=github&utm_source=yaooqinn&utm_campaign=badge_info)<!-- PR-BADGE: PLEASE DO NOT REMOVE THIS COMMENT -->

<!--
Thanks for sending a pull request!

Here are some tips for you:
  1. If this is your first time, please read our contributor guidelines: https://kyuubi.readthedocs.io/en/latest/community/contributions.html
  2. If the PR is related to an issue in https://github.com/yaooqinn/kyuubi/issues, add '[KYUUBI #XXXX]' in your PR title, e.g., '[KYUUBI #XXXX] Your PR title ...'.
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][KYUUBI #XXXX] Your PR title ...'.
-->

### _Why are the changes needed?_
<!--
Please clarify why the changes are needed. For instance,
  1. If you add a feature, you can talk about the use case of it.
  2. If you fix a bug, you can clarify why it is a bug.
-->
Manual cherry-pick some Spark patch into Kyuubi.
1. [Support query auto timeout cancel on thriftserver](apache/spark#29933)
2. [Add config to control if cancel invoke interrupt task on thriftserver](apache/spark#30481)

In order to keep backward with early Spark version, we hard code the config key instead of refer to Spark SQLConf.

Note that, the exists timeout of operator (`kyuubi.operation.idle.timeout`) is to cancel that client has no access with engine. That said if a query run a long time and the client is alive, the query would not be cancelled. Then the new added config `spark.sql.thriftServer.queryTimeout` can handle this case.

### _How was this patch tested?_
Add new test.

Closes #451 from ulysses-you/query-timeout.

212f579 [ulysses-you] docs
9206538 [ulysses-you] empty flaky test
ddab9bf [ulysses-you] flaty test
1da02a0 [ulysses-you] flaty test
edfadf1 [ulysses-you] nit
3f9920b [ulysses-you] address comment
9492c48 [ulysses-you] correct timeout
5df997e [ulysses-you] nit
2124952 [ulysses-you] address comment
192fdcc [ulysses-you] fix tets
d684af6 [ulysses-you] global config
1d1adda [ulysses-you] empty
967a63e [ulysses-you] correct import
128948e [ulysses-you] add session conf in session
144d51b [ulysses-you] fix
a90248b [ulysses-you] unused import
c90386f [ulysses-you] timeout move to operation manager
d780965 [ulysses-you] update docs
a5f7138 [ulysses-you] fix test
f7c7308 [ulysses-you] config name
7f3fb3d [ulysses-you] change conf place
97a011e [ulysses-you] unnecessary change
0953a76 [ulysses-you] move test
38ac0c0 [ulysses-you] Merge branch 'master' of https://github.com/yaooqinn/kyuubi into query-timeout
71bea97 [ulysses-you] refector implementation
35ef6f9 [ulysses-you] update conf
0cad8e2 [ulysses-you] Support query auto timeout cancel on thriftserver

Authored-by: ulysses-you <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
(cherry picked from commit fecdba3)
Signed-off-by: Kent Yao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants